ST310 Course Project

The aim of this project is to demonstrate the use of machine learning techniques learnt in the ST310 Machine Learning module. We rely on both linear (Logistic Regression) and non-linear methods (Random Forests and Gradient Boosted Decision Trees) for a classification problem.

Remark: If executing this notebooks as a .Rmd file, ensure that all libraries are installed and that the cells are executed in sequential order.

Dataset

We have obtained data from the Kaggle March Tabular Playground competition. The data consists of anonymised features, which correspond to a binary outcome variable; in other words the task is a classification problem. Although the data is anonymised, we are told that it is based on a real dataset that deals with predicting the amount of an insurance claim.

We consider the ROC-AUC metric (Receiving Operator Characteristic Area Under the Curve), and accuracy metric. Given the problem is indirectly one related to insurance, the modelling outcome of interest is not only to obtain the correct predictions (accuracy) , but also accurate probabilities, which is what the AUC metric measures.The test accuracy

library(tidyverse)
library(ggplot2)
library(tidymodels)
library(xgboost)
library(catboost)
library(tidyverse)
library(broom)
library(ggplot2)
library(arm)
library(car) #outlier
library(tidyr)
library(glmnet)
library(caret)
library(tidymodels)
library(pROC)
library(randomForest)
library(e1071)
library(arm)
library(data.table)
df <- read.csv(file = "../data/train.csv")

The dataset consists of 30 features, 19 categorical variables and 11 numerical variables., and a binary outcome variable target. We will need to process these categorical variables in some manner, which we shall consider in the subsequent section.

A challenge is that we have no specific domain knowledge about the meaning of particular features, and whether they may be useful. Nonetheless, we can resort to model interpretability methods (Christoph Molnar 2019) - in the linear case, we can examine coefficients and their statistical significance, and in the case of tree-based models, we can utilise feature importances, partial dependence plots, and shapley values.

Preprocessing

We first remove the first column id, which represents an unique identifier for each observation. We then convert the first 19 columns, which are categorical variables into the factor data type in R, To reduce computational time, we subsample 4000 observations from the complete dataset of 300,000 observations. For our subsequent analysis, we will use only this subsample of 4000 observations, although the approaches can be extended to a larger dataset given greater computational resources. We further partition the 4000 observations, into a train set of 3000 observations and a test set of 1000 observations. All training and tuning are performed on the training data, and we will consider model performance on the test set performance scores.

rownames(df) = df$id
df = df[,-1] # remove id column
cat_feats = 1:19 # the first 19 columns are categorical 
cont_feats <- 20:30 # and the next 11 are continuous
df[,cat_feats] <- lapply(df[,cat_feats], as.factor)
df$target <- as.factor(df$target)
# subsample the data for faster model imputation
set.seed(1)
sam = sample(1:nrow(df), 4000)
df_sample = df[sam,]
# Partition data into train and test; test will be our oos data
set.seed(1)
df_split <- initial_split(df_sample, prop = 3/4)
df_train <- training(df_split)
df_test <- testing(df_split)

Exploratory Data Analysis

Univariate EDA

We observe that the univariate distributions of the continuous variables are all multi-modal and non-normal, but they are all normalised to the range of \([0, 1]\).

df %>% pivot_longer(cols = starts_with("cont"), names_to  = "cont") %>% 
   ggplot(aes(x = value))+
   geom_histogram(bins = 100, alpha = 0.85)+
   ggtitle("Continuous features distribution")+
   facet_wrap(cont~.,scales = "free") +
   theme_minimal()

From the distributions of categorical variables, we see that there are variables with substantially more observations in one category, and also variables with a high number of (\(>50\)) categories, which will be an issue to address in our preprocessing step.

df %>% pivot_longer(cols = contains(c("cat")), names_to  = "cat") %>% 
   ggplot(aes(x = value))+
   geom_bar(alpha = 0.85)+
   ggtitle("Categorical features distribution")+
   facet_wrap(cat~.,scales = "free")+
   theme_minimal()

Bivariate EDA

We group the continuous variables by the target and plot them as boxplots to check for any obvious differences discernible by eye. From the plots, claims with target == 1 have lower values of cont3 on average (median) than claims with target == 0, hence we would expect a negative relationship between cont3 and target. Claims with target=1 also have a higher median value of cont4 than claims with target == 0. In addition, we see a group of observations at the tail ends for cont0, cont5, cont7, cont8, cont9, cont10, which may be indicative of outliers. This will be addressed in a subsequent section.

cont.stacked <- gather(data=df[, c(cont_feats, 31)],-target,key="var",value="value")
p.cont <- ggplot(cont.stacked,aes(x=target,y=value),fill=factor(value)) + geom_boxplot() + coord_flip() + facet_wrap(~var, scales="free_x")
p.cont

For categorical variables, we use stacked bar plots to show the percentages of observations in each category that correspond to target == 0 and target == 1 respectively. A much larger proportion of claims with cat13 == B appear to be associated with target == 1 compared to cat13 == A. On the other hand, a much larger percentage of claims correspond to target == 0 if cat18 is A or B, than if cat18 is C or D.

g1 <- c(1:10,31)
g2 <- c(11:19,31)
cate.stacked1 <- gather(data=df[,g1],-target,key="var",value="value")
p.cate1 <- ggplot(cate.stacked1,aes(x=value,fill=target)) + geom_bar(position="fill") + scale_y_continuous(name = "Within group Percentage", labels = scales::percent) + facet_wrap(~var, scales="free_x")
cate.stacked2 <- gather(data=df[,g2],-target,key="var",value="value")
p.cate2 <- ggplot(cate.stacked2,aes(x=value,fill=target)) + geom_bar(position="fill") + scale_y_continuous(name = "Within group Percentage", labels = scales::percent) + facet_wrap(~var, scales="free_x")
p.cate1

p.cate2

We also inspect the correlation matrix for our continuous variables. There seems to be a cluster of variables - cont1, cont2, cont8 - that are highly correlated with each other. This could lead to problems with multicollinearity when using linear models.

df_num <- df[, cont_feats]
cor_matrix <- cor(df_num)
heatmap(cor_matrix)

Finally, we use Principal Components Analysis (PCA) as a means to visualise the data in low-dimension, to determine if there are any explicitly discernible trends. By eye, the classes do not appear to be linearly seperable - which suggest a non-linear method may be more effective. There do not appear to be any significant difference in the PCA representations for each class.

pcs <- prcomp(df[,cont_feats])
set.seed(2021)
ind <- sample(1:dim(df)[1], 20000)
sample <- data.frame(pcs$x[ind,1], pcs$x[ind,2], df[ind, "target"])
names(sample) = c("pc1", "pc2", "y")
ggplot(sample) + geom_jitter(aes(x = pc1, y = pc2, colour = factor(y)),alpha=0.7) + ggtitle('Principal Components')

Source: https://statsandr.com/blog/outliers-detection-in-r/

Modelling

We first consider a naive ‘basline,’ in which we let all predicted labels equal 0 or 1 (whichever leads to the greater accuracy; the accuracy would be the greater of either \(\hat{y}\) or \(1 - \hat{y}\)). In this case, the greater accuracy of \(0.745\) would be obtained if we predict all labels as 0, which suggests the 1 class is less frequently occurring i.e. imbalanced classification. This illustrates the issue with the accuracy metric: we must compare our model accuracy with the naive accuracy and other appropriate benchmarks to prevent a misleading result, as our. On the other hand, predicting all 1s or 0s would result in the worst possible ROC-AUC score of \(0.5\).

# train-test
X_train = as.matrix(df_train[,grepl("cont", colnames(df_train))])
y_train = as.numeric(as.matrix(df_train$target))
X_test = as.matrix(df_test[,grepl("cont", colnames(df_train))])
y_test = as.numeric(as.matrix(df_test$target))
results = data.frame(train_acc = max(1 - mean(y_train), mean(y_train)),
                     train_auc = 0.5,
                     test_acc = max(1 - mean(y_test), mean(y_test)),
                     test_auc = 0.5,
                     row.names=c("naive")
                     )
results

Logistic Regression (Few Predictors)

As a baseline model, we build a simple logistic regression with a few predictors, which are selected from our exploratory data analysis to have discernible differences in target. These are cont3, cont4, cat13, cat18. We use the glm package to fit a logistic regression.

glm1 <- glm(target~cont3+cont4+cat13+cat18,data=df_train, family=binomial(link="logit"),control = list(maxit = 100))
summary(glm1)

Call:
glm(formula = target ~ cont3 + cont4 + cat13 + cat18, family = binomial(link = "logit"), 
    data = df_train, control = list(maxit = 100))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5707  -0.6702  -0.6021   0.3931   2.0095  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.4821     1.0756  -1.378  0.16824    
cont3        -0.6554     0.2276  -2.880  0.00398 ** 
cont4        -0.4772     0.2063  -2.313  0.02070 *  
cat13B        1.8681     0.3346   5.584 2.36e-08 ***
cat18B        0.5581     1.0714   0.521  0.60243    
cat18C        2.6081     1.0816   2.411  0.01590 *  
cat18D        2.4342     1.0805   2.253  0.02427 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3406.6  on 2999  degrees of freedom
Residual deviance: 2962.8  on 2993  degrees of freedom
AIC: 2976.8

Number of Fisher Scoring iterations: 4

All four predictors are significant at 5% level, with the most significant predictors being… The coefficients of the dummy variables indicate the average difference between the log odds of that factor level group compared to the baseline level group. In our regression, the first category (A) of each categorical variable is taken as the baseline group. For example, the coefficient of cat13b implies the following equation: \[\text{logit} P(\text{target}=1|\text{cat13b} = 1) – \text{logit} P(\text{target}=0|\text{cat13b} = 1) = 1.8681. \] In words, this means that claims with cat13b = 1 have 33% higher log odds of target=1 than claims with [catA]. However, we can only assert that there is some association - in this case it is statistically significant relationship with the p-value being 2.36e-08 - between the presence of the variable cat13b = 1 and the target being 1, but not whether there is a causation; in particular, given the lack of knowledge of what the variable represents, we cannot form any meaningful hypothesis.

To interpret the coefficients of the continuous variables, risk ratios need to be calculated using specific pairs of values of the predictors. The non-linearity of the logistic function means that a change of one unit in the value of a predictor is not the same across the range of the predictor.

pred_train <- predict(glm1, df_train, type="response")
pred_test <- predict(glm1, df_test, type="response")

diagnosis <- function(train_pred, test_pred, train_true, test_true){
  train_classes <- ifelse(train_pred > 0.5, 1,0)
  test_classes <- ifelse(test_pred > 0.5, 1,0)
  acc1 <- mean(train_classes == train_true)
  auc1 <- auc(roc(train_true, train_pred))
  acc2 <- mean(test_classes == test_true)
  auc2 <- auc(roc(test_true, test_pred))
  data.frame(train_acc=acc1, train_auc=auc1, test_acc = acc2, test_auc=auc2)
}

results["glm-small",] <- diagnosis(pred_train, pred_test, df_train$target, df_test$target)
results["glm-small",]

SGD

To fulfill the project requirements, we demonstrate a `from scratch’ Stochastic Gradient Descent routine for logistic regression. The deriviation follows (Hastie, T. and Tibshirani, R. and Friedman, J.H. 2009, 120–26)

The logistic loss with a \(\mathcal{L}^{2}\) penalty is given by:

\[l(\boldsymbol{\beta}) = -\sum_{i = 1}^{N} y_{i} log(p(x_{i} ; \boldsymbol{\beta})) + (1 - y_{i}) log(1 - p(x_{i} ; \boldsymbol{\beta})) = \sum_{i = 1}^{N} \left [ y_{i}log \left (\frac{p(x_{i} ; \boldsymbol{\beta})}{1 - p(x_{i} ; \boldsymbol{\beta})} \right) + log(1 - p(x_{i} ; \boldsymbol{\beta})) \right ]\]

\[ l(\boldsymbol{\beta})= -\sum_{i = 1}^{N}\left [y_{i} \boldsymbol{\beta}^{T} x_{i} - log(1 + exp(\boldsymbol{\beta}^{T}x_{i})) \right ]\] Then with the inclusion of the reularisation term:

\[ l(\boldsymbol{\beta}) = -\sum_{i = 1}^{N} \left [y_{i} \boldsymbol{\beta}^{T} x_{i} - log(1 + exp(\boldsymbol{\beta}^{T}x_{i})) \right ] - \lambda \boldsymbol{\beta}^{T} \boldsymbol{\beta}\]

The gradient is given by:

\[\nabla(\boldsymbol{\beta}) = -\sum_{i = 1}^{N} \left [ y_{i} x_{i} - \frac{x_{i}exp(\boldsymbol{\beta}^{T} x_{i})}{1 + exp(\boldsymbol{\beta}^{T} x_{i})} \right] - \lambda 2\boldsymbol{\beta}\]

We use the Barzilai-Borwein method (Murphy, Kevin P 2012, 444–45) to determine the step size.

# binary crossentropy / log-loss
log_loss <- function(x, y, betas, lambda){
  logits <- x %*% betas
  - (t(y) %*% logits - sum(log(1 + exp(logits))) + lambda * t(betas) %*% betas) / dim(x)[1]
}

# logistic regression gradients
gradients <- function(x, y, betas, lambda){
  logits <- x %*% betas
  - (t(x) %*% (y - exp(logits)/(1 + exp(logits)))) - lambda *2 * betas / dim(x)[1]
}

p = dim(X_train)[2]

lambda = 0
n_iters <- 100
init_step_size <- 1e-6

set.seed(2021)
beta_init <- matrix(rnorm(p),nrow=p)
beta_path <- matrix(rep(0, n_iters * p), nrow = n_iters, ncol=p)
beta_path[1,] = beta_init

last_grad <- grad <- gradients(X_train, y_train, beta_path[1,], lambda)
beta_path[2,] = beta_init - init_step_size * grad
grad <- gradients(X_train, y_train, beta_path[2,], lambda)

losses <- rep(0, n_iters)

for (i in 3:n_iters){
    step_size <- as.numeric(t(beta_path[i - 1,] - beta_path[i - 2,]) %*% (grad - last_grad) / 
                    (t(grad - last_grad) %*% (grad - last_grad)))
    beta_path[i,] <- beta_path[i - 1,] - step_size * grad
    last_grad <- grad
    grad <- gradients(X_train, y_train, beta_path[i, ], lambda)
    losses[i] <- log_loss(X_train, y_train, beta_path[i,], lambda)
}
ggplot(data.frame(step = 3:n_iters, loss=losses[3:n_iters])) + 
  geom_line(aes(x = step, y = loss)) +
  ggtitle("Binary Crossentropy vs. Iterations")

pred_train <- as.numeric(1 / (1 + exp(-X_train %*% beta_path[100,])))
pred_test <- as.numeric(1 / (1 + exp(-X_test %*% beta_path[100,])))
results["sgd",] <- diagnosis(pred_train, pred_test, df_train$target, df_test$target)
results["sgd",]

Logistic Regression (All Predictors)

We now run a logistic regression using all variables, using the glm package.

glm2 <- glm(target~., data=df_train, family=binomial(link="logit"), 
            control = list(maxit = 100))
pred_train <- predict(glm2, df_train, type="response")
glm2$xlevels = lapply(df[,cat_feats], levels)
pred_test <- predict(glm2, df_test, type="response")
results["glm-full",] <- diagnosis(pred_train, pred_test, df_train$target, df_test$target)
results["glm-full",]
anova(glm1, glm2, test="Chisq")
Analysis of Deviance Table

Model 1: target ~ cont3 + cont4 + cat13 + cat18
Model 2: target ~ cat0 + cat1 + cat2 + cat3 + cat4 + cat5 + cat6 + cat7 + 
    cat8 + cat9 + cat10 + cat11 + cat12 + cat13 + cat14 + cat15 + 
    cat16 + cat17 + cat18 + cont0 + cont1 + cont2 + cont3 + cont4 + 
    cont5 + cont6 + cont7 + cont8 + cont9 + cont10
  Resid. Df Resid. Dev  Df Deviance Pr(>Chi)
1      2993       2963                      
2      2552      42964 441   -40001         

Outlier Detection

At this stage, we seek to remove outliers spotted in the bivariate plots in our exploratory analysis for cont0, cont5, cont7, cont8, cont9 and cont10. To investigate this, we identified observations that lie at the extreme percentiles of these variables. Using the Hampel filter, which considers points lying outside the median plus or minus 3 mean absolute deviations as outliers, more than 200 observations per variable were classified as such. This suggests that these may not be outliers but that the distribution is just heavy-tailed. Without further information on the reasonable scale of values that individual variables can take (along with the fact that they are all normalised), we find it challenging to clearly identify outliers through descriptive statistics and decide not to exclude any such points using this approach. We will proceed to detect for outliers in another approach later on.

hampel_filter <- function(df){
   lower_bound <- median(df) - 3 * mad(df, constant = 1)
   upper_bound <- median(df) + 3 * mad(df, constant = 1)
   outlier_ind <- which(df < lower_bound | df > upper_bound)
   return(outlier_ind)
}
percentile_filter <- function(df, lq = 0.001, uq = 0.999){
   lower_bound <- quantile(df, lq)
   upper_bound <- quantile(df, uq)
   outlier_ind <- which(df < lower_bound | df > upper_bound)
   return(outlier_ind)
}
hampel_count <- function(x){length(hampel_filter(x))}
pct_count <- function(x){length(percentile_filter(x))}

outlier_counts <- data.frame(lapply(df_train[, cont_feats], hampel_count))
outlier_counts[2, ] <- data.frame(lapply(df_train[, cont_feats], pct_count))
outlier_counts

We attempt to detect outliers under a logistic regression. Formally, outliers are defined as observations with a response vector that is unusual conditional on covariates. They are formally identified through studentized residuals. Intuitively, outliers have large residuals and we can formally test (by looking at the Bonferroni–adjusted p-values) if these residuals are significantly larger than the other observations. We use the outlierTest function from the car to determine the outliers from the glm model.

Observations that are far from the average covariate pattern are considered to have high leverage and can be measured using the hat value. Here, there are many points with high leverage.

outlierTest(glm2)
outliers <- as.numeric(names(outlierTest(glm2)$p))

Finally, we measure for influence, which is an observation that is an outlier and have high leverage. These are likely to influence the regression coefficients and influence can be thought of as the product of leverage and outlier. Here, we plot studentised residuals against hat-values with the size of a circle being proportional to the Cook’s distance of an observation- a measure of influence.

influenceIndexPlot(glm2, vars = "hat")

influencePlot(glm2)

Here, we observe that there are a number of observations with high influence - outliers with high leverage. Thus, we remove these observations and compare the performance of our updated model with the original model.

In the later models, we will also exclude the same outliers.

influencers <- as.numeric(rownames(influencePlot(glm2)))
glm2_influencers <- update(glm2, subset = c(-influencers))
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
glm2_outliers <- update(glm2, subset = c(-outliers))
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred

Logistic + Ridge

Given the issue of high dimensionality, we consider a regularised form of logistic regression. We use the glmnet package; in doing so we need to convert the data type into matrices.The glmnet package requires the data to be in a matrix data type, and hence we make the corresponding adjustment.

X_train = df_train[, -length(df_train)]
y_train <- df_train$target
X_test = df_test[, -length(df_test)]
y_test <- df_test$target
X_train = model.matrix(~., X_train)
X_test = model.matrix(~., X_test)

glm3 <- cv.glmnet(X_train, y_train, 
                  family="binomial"(link="logit"), alpha=0)
glm3

Call:  cv.glmnet(x = X_train, y = y_train, family = binomial(link = "logit"),      alpha = 0) 

Measure: GLM Deviance 

    Lambda Index Measure      SE Nonzero
min 0.1422    80  0.7846 0.01822     458
1se 0.3606    70  0.8010 0.01636     458
pred_train <- as.numeric(predict(glm3, X_train, type="response"))
pred_test <- as.numeric(predict(glm3, X_test, type="response"))
results["glm-ridge",] <- diagnosis(pred_train, pred_test, df_train$target, df_test$target)
results["glm-ridge",]

Nonlinear Fit (Tree-based)

Random Forest

To improve performance, we draw on the usage of non-linear tree-based models, specifically random forests. Intuitively, a random forest averages different decision trees (known as bagging) so as to reduce the variance of individual trees.

control <- trainControl(method = "cv",
    number = 2,
    search = "grid")

ptm=proc.time()
set.seed(1)
rf1 <- train(target~.,
    data = df_train,
    method = "rf",
    metric = "Accuracy",
    trControl=control, 
    importance=T,
    maxnodes=128,
    ntree=64)
pred_train <- predict(rf1,df_train,type='prob')[,2]
pred_test <- predict(rf1, df_test, type='prob')[,2]
results["rf",] <- diagnosis(pred_train, pred_test, df_train$target, df_test$target)
results["rf", ]

The random forest has a test accuracy of 0.836 and a test AUC of 0.872, which is higher than the previous linear models. However, the train accuracy and AUC are significantly higher than the test performance, which suggest that there may be some degree of overfitting to the train data.

Random forests can be used to rank the importance of different features. Specifically, the x-axis is the Mean Decrease Accuracy, which reports how much accuracy the model loses when we exclude this variable. The more the accuracy falls by, the more important the particular variable is. Note here that for categorical variables, each level of the category is classified as a single variable. In this plot, we recorded the 30 most important variables.

We then run another random forest with the most important features. In doing so, we hope to reduce the degree of overfitting by reducing the complexity of the model. However, it does not make sense to drop some levels of a categorical variable while including the other levels. Hence, as long as a level is present in the top 30 features, we will include the entire category in our updated random forest model. This results in us keeping only 22 variables, from an initial 30.

Our reduced random forest has a slightly improved test accuracy and a slightly decreased test AUC. However, it does not solve the potential problem of overfitting as train performance is still significantly better than test performance. In fact, train performance on the reduced random forest is better than the random forest with a full set of variables - the reduced complexity of the model enabled it to have a lower bias on the training set.

XGBoost

For completeness, we also consider the xgboost library for gradient boosted decision trees. Gradient Boosted Decision Trees. The XGBoost package, introduced in (Chen, Tianqi and Guestrin, Carlos 2016) is a variant of Gradient Boosted Decision Trees (Hastie, T. and Tibshirani, R. and Friedman, J.H. 2009, 353–74)

dmy_train <- dummyVars("~.", data = df_train[,-length(df_train)])
dmy_test <- dummyVars("~.", data = df_test[,-length(df_test)])
X_train <- as.matrix(data.frame(predict(dmy_train,df_train)))
X_test <- as.matrix(data.frame(predict(dmy_test,df_test)))
y_train = as.integer(as.matrix(df_train$target))
y_test = as.integer(as.matrix(df_test$target))
bst <- xgboost(data = X_train, label=y_train, max_depth = 2, nround = 10, 
               verbose=0,
               objective='binary:logistic',
               eval_metric="logloss")

pred_train <- predict(bst, X_train, type="response")
pred_test <- predict(bst, X_test, type="response")
results["xgb",] <- diagnosis(pred_train, pred_test, df_train$target, df_test$target)
results["xgb",]

importance_matrix <- xgb.importance(model=bst)
xgb.plot.importance(importance_matrix)

# shapley values
xgboost::xgb.ggplot.shap.summary(X_test, model = bst, 
                                 target_class = 1, top_n = 20)  # Summary plot

CatBoost

We also consider briefly examine the catboost library for gradient boosted decision trees, given its popularity on machine learning competitions such as Kaggle. The CatBoost library has the advantage of learning a target encoding for categorical variables. From an implementation point of view, this may reduce the preprocessing that may be required.For details on CatBoost, refer to (Prokhorenkova, Liudmila and Gusev, Gleb and Vorobev, Aleksandr and Dorogush, Anna Veronika and Gulin, Andrey 2017)

X_train = df_train[,c(cat_feats, cont_feats)]
y_train = as.integer(df_train$target)
X_test = df_test[,c(cat_feats, cont_feats)]
y_test = as.integer(df_test$target)

pool <- catboost.load_pool(X_train, y_train, cat_features = cat_feats)
model <- catboost.train(pool, params=list(depth = 8, iterations = 10, 
                                          loss_function='Logloss', verbose=0))

pred_train <- catboost.predict(model, catboost.load_pool(X_train), prediction_type = 'Probability')
pred_test <- catboost.predict(model, catboost.load_pool(X_test), prediction_type = 'Probability')

data_shap_tree <- catboost.get_feature_importance(model, pool = pool,
                                                  type = "ShapValues")
data_shap_tree <- data.frame(data_shap_tree[, -ncol(data_shap_tree)]) 
names(data_shap_tree) = names(df[, c(cat_feats, cont_feats)])


ggplot(stack(data_shap_tree), aes(x = ind, y = values)) +
    geom_point(aes(color = values)) + coord_flip() + ggtitle("Shapely Values for each variable")


results["catboost",] <- diagnosis(pred_train, pred_test, df_train$target, df_test$target)
results["catboost",]
feat_importance <- catboost.get_feature_importance(model, pool)
importances <- data.frame(feat_importance[order(feat_importance, decreasing=FALSE),])
importances$features = rownames(importances)
names(importances) <- c("importance","features")
importances$features <- factor(importances$features, level=importances$features)
ggplot(importances, aes(x=importance, y=features)) + geom_bar(stat="identity")

Results

results

Bibliography

Chen, Tianqi and Guestrin, Carlos. 2016. “XGBoost: A Scalable Tree Boosting System.” https://arxiv.org/pdf/1603.02754.pdf.
Christoph Molnar. 2019. “Interpretable Machine Learning.” “https://christophm.github.io/interpretable-ml-book”.
Efron, Bradley and Hastie, Trevor. 2016. “Computer Age Statistical Inference.” Cambridge University Press.
Hastie, T. and Tibshirani, R. and Friedman, J.H. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2017. “An Introduction To Statistical Learning.” Springer.
Murphy, Kevin P. 2012. “Machine Learning: A Probabilistic Perspective.” MIT Press. https://christophm.github.io/interpretable-ml-book.
Prokhorenkova, Liudmila and Gusev, Gleb and Vorobev, Aleksandr and Dorogush, Anna Veronika and Gulin, Andrey. 2017. “CatBoost: Unbiased Boosting With Categorical Features.” https://arxiv.org/pdf/1706.09516.pdf.
---
title: "ST310 Course Project"
author: "Chris Chia, Mun Fai Chan, Zhen Yen Chan"
date: "`r format(Sys.time(), '%d/%m/%y')`"
output:
  html_notebook:
    toc: true
  html_document:
    toc: true
    df_print: paged
    keep_md: true
  pdf_document: default
header-includes:
- \usepackage {hyperref}
- \hypersetup {colorlinks = true, linkcolor = blue, urlcolor = blue}
references:
- id: hastie2009elements
  title: 'The Elements of Statistical Learning: Data Mining, Inference, and Prediction'
  author: Hastie, T. and Tibshirani, R. and Friedman, J.H.
  publisher: Springer
  type: book
  issued:
    year: 2009
- id: james2017introduction
  title: "An Introduction To Statistical Learning"
  author: James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani
  publisher: Springer
  issued:
    year: 2017
- id: efron2016computer
  title: "Computer Age Statistical Inference"
  author: Efron, Bradley and Hastie, Trevor
  publisher: Cambridge University Press
  issued:
    year: 2016
- id: molnar2019
  title: "Interpretable Machine Learning"
  author: Christoph Molnar
  issued:
    year: 2019
  URL: “https://christophm.github.io/interpretable-ml-book”
- id: murphy2012machine
  title: "Machine Learning: A Probabilistic Perspective"
  author: Murphy, Kevin P
  publisher: MIT Press
  issued:
    year: 2012
  URL: "https://christophm.github.io/interpretable-ml-book"
- id: prokhorenkova2017catboost
  title: "CatBoost: Unbiased Boosting With Categorical Features"
  author: Prokhorenkova, Liudmila and Gusev, Gleb and Vorobev, Aleksandr and Dorogush, Anna Veronika and Gulin, Andrey
  issued:
    year: 2017
  URL: "https://arxiv.org/pdf/1706.09516.pdf"
- id: chen2016xgboost
  title: "XGBoost: A Scalable Tree Boosting System"
  author: Chen, Tianqi and Guestrin, Carlos
  issued:
    year: 2016
  URL: "https://arxiv.org/pdf/1603.02754.pdf"
---

# ST310 Course Project

The aim of this project is to demonstrate the use of machine learning techniques learnt in the ST310 Machine Learning module. We rely on both linear (Logistic Regression) and non-linear methods (Random Forests and Gradient Boosted Decision Trees) for a classification problem.

**Remark**: If executing this notebooks as a `.Rmd` file, ensure that all libraries are installed and that the cells are executed in sequential order.

## Dataset

We have obtained data from the [Kaggle March Tabular Playground](#https://www.kaggle.com/c/tabular-playground-series-mar-2021/overview) competition. The data consists of *anonymised features*, which correspond to a binary outcome variable; in other words the task is a **classification problem**. Although the data is anonymised, we are told that it is based on a real dataset that deals with predicting the amount of an insurance claim.

We consider the **ROC-AUC metric** (Receiving Operator Characteristic Area Under the Curve), and **accuracy** metric. Given the problem is indirectly one related to insurance, the modelling outcome of interest is not only  to obtain the correct predictions (accuracy) , but also accurate probabilities, which is what the AUC metric measures.The test accuracy

```{r load-libs, warning=FALSE, message=FALSE}
library(tidyverse)
library(ggplot2)
library(tidymodels)
library(xgboost)
library(catboost)
library(tidyverse)
library(broom)
library(ggplot2)
library(arm)
library(car) #outlier
library(tidyr)
library(glmnet)
library(caret)
library(tidymodels)
library(pROC)
library(randomForest)
library(e1071)
library(arm)
library(data.table)
df <- read.csv(file = "../data/train.csv")
cat(c("The dimensions of the array are :", dim(df)[1], ", ", dim(df)[2]))
```
The dataset consists of 30 features,  19 *categorical variables* and 11 numerical variables., and a binary outcome variable `target`. We will need to process these categorical variables in some manner, which we shall consider in the [subsequent section](#preprocessing).

A challenge is that we have no *specific* domain knowledge about the meaning of particular features, and whether they may be useful. Nonetheless, we can resort to model interpretability methods [@molnar2019] - in the linear case, we can examine coefficients and their statistical significance, and in the case of tree-based models, we can utilise feature importances, partial dependence plots, and shapley values.

## Preprocessing

We first remove the first column `id`, which represents an unique identifier for each observation. We then convert the first 19 columns, which are categorical variables into the *factor* data type in R, To reduce computational time, we subsample 4000 observations from the complete dataset of 300,000 observations. For our subsequent analysis, we will use only this subsample of 4000 observations, although the approaches can be extended to a larger dataset given greater computational resources.  We further partition the 4000 observations, into a train set of 3000 observations and a test set of  1000 observations. All training and tuning are performed on the training data, and we will consider model performance on the test set performance scores. 

```{r partition}
rownames(df) = df$id
df = df[,-1] # remove id column
cat_feats = 1:19 # the first 19 columns are categorical 
cont_feats <- 20:30 # and the next 11 are continuous
df[,cat_feats] <- lapply(df[,cat_feats], as.factor)
df$target <- as.factor(df$target)
# subsample the data for faster model imputation
set.seed(1)
sam = sample(1:nrow(df), 4000)
df_sample = df[sam,]
# Partition data into train and test; test will be our oos data
set.seed(1)
df_split <- initial_split(df_sample, prop = 3/4)
df_train <- training(df_split)
df_test <- testing(df_split)
```

## Exploratory Data Analysis

#### Univariate EDA
We observe that the univariate distributions of the *continuous* variables are all multi-modal and non-normal, but they are all normalised to the range of $[0, 1]$.
```{r cont-viz}
df %>% pivot_longer(cols = starts_with("cont"), names_to  = "cont") %>% 
   ggplot(aes(x = value))+
   geom_histogram(bins = 100, alpha = 0.85)+
   ggtitle("Continuous features distribution")+
   facet_wrap(cont~.,scales = "free") +
   theme_minimal()
```
From the distributions of categorical variables, we see that there are variables with substantially more observations in one category, and also variables with a high number of ($>50$) categories, which will be an issue to address in our preprocessing step.  
```{r cat-viz}
df %>% pivot_longer(cols = contains(c("cat")), names_to  = "cat") %>% 
   ggplot(aes(x = value))+
   geom_bar(alpha = 0.85)+
   ggtitle("Categorical features distribution")+
   facet_wrap(cat~.,scales = "free")+
   theme_minimal()
```



### Bivariate EDA

We group the continuous variables by the target and plot them as boxplots to check for any obvious differences discernible by eye. From the plots, claims with `target == 1` have lower values of cont3 on average (median) than claims with `target == 0`, hence we would expect a negative relationship between cont3 and target. Claims with target=1 also have a higher median value of cont4 than claims with `target == 0`. In addition, we see a group of observations at the tail ends for `cont0, cont5, cont7, cont8, cont9, cont10`, which may be indicative of outliers. This will be addressed in a [subsequent section](#logistic-regression).

```{r cont-by-target}
cont.stacked <- gather(data=df[, c(cont_feats, 31)],-target,key="var",value="value")
p.cont <- ggplot(cont.stacked,aes(x=target,y=value),fill=factor(value)) + geom_boxplot() + coord_flip() + facet_wrap(~var, scales="free_x")
p.cont
```
For categorical variables, we use stacked bar plots to show the percentages of observations in each category that correspond to `target == 0` and `target == 1` respectively. A much larger proportion of claims with `cat13 == B` appear to be associated with `target == 1` compared to `cat13 == A`. On the other hand, a much larger percentage of claims correspond to `target == 0` if `cat18` is `A` or `B`, than if `cat18` is `C` or `D`.

```{r cat-by-target, warning=FALSE, message=FALSE}
g1 <- c(1:10,31)
g2 <- c(11:19,31)
cate.stacked1 <- gather(data=df[,g1],-target,key="var",value="value")
p.cate1 <- ggplot(cate.stacked1,aes(x=value,fill=target)) + geom_bar(position="fill") + scale_y_continuous(name = "Within group Percentage", labels = scales::percent) + facet_wrap(~var, scales="free_x")
cate.stacked2 <- gather(data=df[,g2],-target,key="var",value="value")
p.cate2 <- ggplot(cate.stacked2,aes(x=value,fill=target)) + geom_bar(position="fill") + scale_y_continuous(name = "Within group Percentage", labels = scales::percent) + facet_wrap(~var, scales="free_x")
p.cate1
p.cate2
```

We also inspect the correlation matrix for our *continuous* variables. There seems to be a cluster of variables - `cont1, cont2, cont8` - that are highly correlated with each other. This could lead to problems with multicollinearity when using linear models.

```{r cor-mat}
df_num <- df[, cont_feats]
cor_matrix <- cor(df_num)
heatmap(cor_matrix)
```
Finally, we use Principal Components Analysis (PCA) as a means to visualise the data in low-dimension, to determine if there are any explicitly discernible trends. By eye, the classes do not appear to be linearly seperable - which suggest a non-linear method may be more effective. There do not appear to be any significant difference in the PCA representations for each class.

```{r pca-viz}
pcs <- prcomp(df[,cont_feats])
set.seed(2021)
ind <- sample(1:dim(df)[1], 20000)
sample <- data.frame(pcs$x[ind,1], pcs$x[ind,2], df[ind, "target"])
names(sample) = c("pc1", "pc2", "y")
ggplot(sample) + geom_jitter(aes(x = pc1, y = pc2, colour = factor(y)),alpha=0.7) + ggtitle('Principal Components')
cumul_var <- cumsum(pcs$sdev^2 / sum(pcs$sdev^2))
ggplot(data.frame(feature = 1:11, cumul_var = cumul_var)) + geom_line(aes(x = feature,y = cumul_var))
```
Source: https://statsandr.com/blog/outliers-detection-in-r/

## Modelling

We first consider a naive 'basline', in which we let all predicted labels equal 0 or 1 (whichever leads to the greater accuracy; the accuracy would be the greater of either $\hat{y}$ or $1 - \hat{y}$). In this case, the greater accuracy of $0.745$ would be obtained if we predict all labels as `0`, which suggests the `1` class is less frequently occurring i.e. imbalanced classification. This illustrates the issue with the accuracy metric: we must compare our model accuracy with the *naive* accuracy and other appropriate benchmarks to prevent a misleading result, as our. On the other hand, predicting all 1s or 0s would result in the worst possible ROC-AUC score of $0.5$.

```{r}
# train-test
X_train = as.matrix(df_train[,grepl("cont", colnames(df_train))])
y_train = as.numeric(as.matrix(df_train$target))
X_test = as.matrix(df_test[,grepl("cont", colnames(df_train))])
y_test = as.numeric(as.matrix(df_test$target))
results = data.frame(train_acc = max(1 - mean(y_train), mean(y_train)),
                     train_auc = 0.5,
                     test_acc = max(1 - mean(y_test), mean(y_test)),
                     test_auc = 0.5,
                     row.names=c("naive")
                     )
results
```

### Logistic Regression (Few Predictors)

As a baseline model, we build a simple logistic regression with a few predictors, which are selected from our exploratory data analysis to have discernible differences in target. These are `cont3, cont4, cat13, cat18`. We use the `glm` package to fit a logistic regression.

```{r}
glm1 <- glm(target~cont3+cont4+cat13+cat18,data=df_train, family=binomial(link="logit"),control = list(maxit = 100))
summary(glm1)
```

All four predictors are significant at 5% level, with the most significant predictors being... The coefficients of the dummy variables indicate the average difference between the log odds of that factor level group compared to the baseline level group. In our regression, the first category (A) of each categorical variable is taken as the baseline group. For example, the coefficient of `cat13b` implies the following equation: 
$$\text{logit} P(\text{target}=1|\text{cat13b} = 1) – \text{logit} P(\text{target}=0|\text{cat13b} = 1) = 1.8681. $$
In words, this means that claims with `cat13b = 1` have 33% higher log odds of target=1 than claims with [catA]. However, we can only assert that there is some  association - in this case it is statistically significant relationship with the p-value being `2.36e-08` - between the presence of the variable `cat13b = 1` and the target being 1, but not whether there is a causation; in particular, given the lack of knowledge of what the variable represents, we cannot form any meaningful hypothesis.

To interpret the coefficients of the continuous variables, risk ratios need to be calculated using specific pairs of values of the predictors. The non-linearity of the logistic function means that a change of one unit in the value of a predictor is not the same across the range of the predictor.

```{r message=FALSE, warning=FALSE}
pred_train <- predict(glm1, df_train, type="response")
pred_test <- predict(glm1, df_test, type="response")

diagnosis <- function(train_pred, test_pred, train_true, test_true){
  train_classes <- ifelse(train_pred > 0.5, 1,0)
  test_classes <- ifelse(test_pred > 0.5, 1,0)
  acc1 <- mean(train_classes == train_true)
  auc1 <- auc(roc(train_true, train_pred))
  acc2 <- mean(test_classes == test_true)
  auc2 <- auc(roc(test_true, test_pred))
  data.frame(train_acc=acc1, train_auc=auc1, test_acc = acc2, test_auc=auc2)
}

results["glm-small",] <- diagnosis(pred_train, pred_test, df_train$target, df_test$target)
results["glm-small",]
```

### SGD

To fulfill the project requirements, we demonstrate a `from scratch' Stochastic Gradient Descent routine  for logistic regression. The deriviation follows [@hastie2009elements, pp. 120-126]

The logistic loss with a \( \mathcal{L}^{2}\) penalty is given by:

$$l(\boldsymbol{\beta}) = -\sum_{i = 1}^{N} y_{i} log(p(x_{i} ; \boldsymbol{\beta})) + (1 - y_{i}) log(1 - p(x_{i} ; \boldsymbol{\beta})) = \sum_{i = 1}^{N} \left [ y_{i}log \left (\frac{p(x_{i} ; \boldsymbol{\beta})}{1 - p(x_{i} ; \boldsymbol{\beta})} \right) + log(1 - p(x_{i} ; \boldsymbol{\beta})) \right ]$$

$$ l(\boldsymbol{\beta})= -\sum_{i = 1}^{N}\left [y_{i} \boldsymbol{\beta}^{T} x_{i} - log(1 + exp(\boldsymbol{\beta}^{T}x_{i})) \right ]$$
Then with the inclusion of the reularisation term:

$$ l(\boldsymbol{\beta}) = -\sum_{i = 1}^{N} \left [y_{i} \boldsymbol{\beta}^{T} x_{i} - log(1 + exp(\boldsymbol{\beta}^{T}x_{i})) \right ] - \lambda \boldsymbol{\beta}^{T} \boldsymbol{\beta}$$


The gradient is given by:

$$\nabla(\boldsymbol{\beta}) = -\sum_{i = 1}^{N} \left [ y_{i} x_{i} - \frac{x_{i}exp(\boldsymbol{\beta}^{T} x_{i})}{1 + exp(\boldsymbol{\beta}^{T} x_{i})} \right] - \lambda 2\boldsymbol{\beta}$$

We use the Barzilai-Borwein method [@murphy2012machine, pp. 444-445] to determine the step size.

```{r gradient-descent}
# binary crossentropy / log-loss
log_loss <- function(x, y, betas, lambda){
  logits <- x %*% betas
  - (t(y) %*% logits - sum(log(1 + exp(logits))) + lambda * t(betas) %*% betas) / dim(x)[1]
}

# logistic regression gradients
gradients <- function(x, y, betas, lambda){
  logits <- x %*% betas
  - (t(x) %*% (y - exp(logits)/(1 + exp(logits)))) - lambda *2 * betas / dim(x)[1]
}

p = dim(X_train)[2]

lambda = 0
n_iters <- 100
init_step_size <- 1e-6

set.seed(2021)
beta_init <- matrix(rnorm(p),nrow=p)
beta_path <- matrix(rep(0, n_iters * p), nrow = n_iters, ncol=p)
beta_path[1,] = beta_init

last_grad <- grad <- gradients(X_train, y_train, beta_path[1,], lambda)
beta_path[2,] = beta_init - init_step_size * grad
grad <- gradients(X_train, y_train, beta_path[2,], lambda)

losses <- rep(0, n_iters)

for (i in 3:n_iters){
    step_size <- as.numeric(t(beta_path[i - 1,] - beta_path[i - 2,]) %*% (grad - last_grad) / 
                    (t(grad - last_grad) %*% (grad - last_grad)))
    beta_path[i,] <- beta_path[i - 1,] - step_size * grad
    last_grad <- grad
    grad <- gradients(X_train, y_train, beta_path[i, ], lambda)
    losses[i] <- log_loss(X_train, y_train, beta_path[i,], lambda)
}
ggplot(data.frame(step = 3:n_iters, loss=losses[3:n_iters])) + 
  geom_line(aes(x = step, y = loss)) +
  ggtitle("Binary Crossentropy vs. Iterations")
```


```{r message=FALSE, warning=FALSE}
pred_train <- as.numeric(1 / (1 + exp(-X_train %*% beta_path[100,])))
pred_test <- as.numeric(1 / (1 + exp(-X_test %*% beta_path[100,])))
results["sgd",] <- diagnosis(pred_train, pred_test, df_train$target, df_test$target)
results["sgd",]
```

### Logistic Regression (All Predictors)

We now run a logistic regression using all variables, using the `glm` package.
```{r message=FALSE, warning=FALSE, output=FALSE}
glm2 <- glm(target~., data=df_train, family=binomial(link="logit"), 
            control = list(maxit = 100))
# display(glm2)
```

```{r glm-full-results, message=FALSE, warning=FALSE}
pred_train <- predict(glm2, df_train, type="response")
glm2$xlevels = lapply(df[,cat_feats], levels)
pred_test <- predict(glm2, df_test, type="response")
results["glm-full",] <- diagnosis(pred_train, pred_test, df_train$target, df_test$target)
results["glm-full",]
anova(glm1, glm2, test="Chisq")
```
### Outlier Detection

 At this stage, we seek to remove outliers spotted in the bivariate plots in our exploratory analysis for cont0, cont5, cont7, cont8, cont9 and cont10. To investigate this, we identified observations that lie at the extreme percentiles of these variables. Using the Hampel filter, which considers points lying outside the median plus or minus 3 mean absolute deviations as outliers, more than 200 observations per variable were classified as such. This suggests that these may not be outliers but that the distribution is just heavy-tailed. Without further information on the reasonable scale of values that individual variables can take (along with the fact that they are all normalised), we find it challenging to clearly identify outliers through descriptive statistics and decide not to exclude any such points using this approach. We will proceed to detect for outliers in another approach later on. 

```{r}
hampel_filter <- function(df){
   lower_bound <- median(df) - 3 * mad(df, constant = 1)
   upper_bound <- median(df) + 3 * mad(df, constant = 1)
   outlier_ind <- which(df < lower_bound | df > upper_bound)
   return(outlier_ind)
}
percentile_filter <- function(df, lq = 0.001, uq = 0.999){
   lower_bound <- quantile(df, lq)
   upper_bound <- quantile(df, uq)
   outlier_ind <- which(df < lower_bound | df > upper_bound)
   return(outlier_ind)
}
hampel_count <- function(x){length(hampel_filter(x))}
pct_count <- function(x){length(percentile_filter(x))}

outlier_counts <- data.frame(lapply(df_train[, cont_feats], hampel_count))
outlier_counts[2, ] <- data.frame(lapply(df_train[, cont_feats], pct_count))
outlier_counts
```


We attempt to detect outliers under a logistic regression. Formally, outliers are defined as observations with a response vector that is unusual conditional on covariates. They are formally identified through studentized residuals. Intuitively, outliers have large residuals and we can formally test (by looking at the Bonferroni--adjusted p-values) if these residuals are significantly larger than the other observations. We use the `outlierTest` function from the `car` to determine the outliers from the `glm` model.

Observations that are far from the average covariate pattern are considered to have high leverage and can be measured using the hat value. Here, there are many points with high leverage. 

```{r}
outlierTest(glm2)
outliers <- as.numeric(names(outlierTest(glm2)$p))
```

Finally, we measure for influence, which is an observation that is an outlier and have high leverage. These are likely to influence the regression coefficients and influence can be thought of as the product of leverage and outlier. Here, we plot studentised residuals against hat-values with the size of a circle being proportional to the Cook's distance of an observation- a measure of influence.


```{r}
influenceIndexPlot(glm2, vars = "hat")
```

```{r}
influencePlot(glm2)
```
Here, we observe that there are a number of observations with high influence - outliers with high leverage. Thus, we remove these observations and compare the performance of our updated model with the original model. 

In the later models, we will also exclude the same outliers. 

```{r influencers}
influencers <- as.numeric(rownames(influencePlot(glm2)))
glm2_influencers <- update(glm2, subset = c(-influencers))
glm2_outliers <- update(glm2, subset = c(-outliers))
removal_list <- union(outliers, influencers)
glm2_removed <- update(glm2, subset = c(-removal_list))
compareCoefs(glm2, glm2_influencers, glm2_outliers, glm2_removed)
# actually just use glm2 and glm2_removed
```



### Logistic + Ridge

Given the issue of *high dimensionality*, we consider a regularised form of logistic regression. We use the `glmnet` package; in doing so we need to convert the data type into matrices.The `glmnet` package requires the data to be in a *matrix* data type, and hence we make the corresponding adjustment.

```{r message=FALSE, output=FALSE}
X_train = df_train[, -length(df_train)]
y_train <- df_train$target
X_test = df_test[, -length(df_test)]
y_test <- df_test$target
X_train = model.matrix(~., X_train)
X_test = model.matrix(~., X_test)

glm3 <- cv.glmnet(X_train, y_train, 
                  family="binomial"(link="logit"), alpha=0)
glm3
```


```{r glm-ridge-scores, message=FALSE, warning=FALSE}
pred_train <- as.numeric(predict(glm3, X_train, type="response"))
pred_test <- as.numeric(predict(glm3, X_test, type="response"))
results["glm-ridge",] <- diagnosis(pred_train, pred_test, df_train$target, df_test$target)
results["glm-ridge",]
```

### Nonlinear Fit (Tree-based)

#### Random Forest

To improve performance, we draw on the usage of non-linear tree-based models, specifically random forests. Intuitively, a random forest averages different decision trees (known as bagging) so as to reduce the variance of individual trees.

```{r}
control <- trainControl(method = "cv",
    number = 2,
    search = "grid")

ptm=proc.time()
set.seed(1)
rf1 <- train(target~.,
    data = df_train,
    method = "rf",
    metric = "Accuracy",
    trControl=control, 
    importance=T,
    maxnodes=128,
    ntree=64)
time1=proc.time()-ptm
time1
#ls(rf1)
varImpPlot(rf1$finalModel, n.var=30, type=1)
```
```{r rf-scores, message=FALSE, warning=FALSE}
pred_train <- predict(rf1,df_train,type='prob')[,2]
pred_test <- predict(rf1, df_test, type='prob')[,2]
results["rf",] <- diagnosis(pred_train, pred_test, df_train$target, df_test$target)
results["rf", ]
```

The random forest has a test accuracy of 0.836 and a test AUC of 0.872, which is higher than the previous linear models. However, the train accuracy and AUC are significantly higher than the test performance, which suggest that there may be some degree of overfitting to the train data. 

Random forests can be used to rank the importance of different features. Specifically, the x-axis is the Mean Decrease Accuracy, which reports how much accuracy the model loses when we exclude this variable. The more the accuracy falls by, the more important the particular variable is. Note here that for categorical variables, each level of the category is classified as a single variable. In this plot, we recorded the 30 most important variables. 

We then run another random forest with the most important features. In doing so, we hope to reduce the degree of overfitting by reducing the complexity of the model. However, it does not make sense to drop some levels of a categorical variable while including the other levels. Hence, as long as a level is present in the top 30 features, we will include the entire category in our updated random forest model. This results in us keeping only 22 variables, from an initial 30. 

Our reduced random forest has a slightly improved test accuracy and a slightly decreased test AUC. However, it does not solve the potential problem of overfitting as train performance is still significantly better than test performance. In fact, train performance on the reduced random forest is better than the random forest with a full set of variables - the reduced complexity of the model enabled it to have a lower bias on the training set. 

#### XGBoost

For completeness, we also consider the `xgboost` library for **gradient boosted decision trees**. Gradient Boosted Decision Trees. The XGBoost package, introduced in [@chen2016xgboost] is a variant of Gradient Boosted Decision Trees [@hastie2009elements, pp. 353-374]


```{r xgb, message=FALSE, warning=FALSE}
dmy_train <- dummyVars("~.", data = df_train[,-length(df_train)])
dmy_test <- dummyVars("~.", data = df_test[,-length(df_test)])
X_train <- as.matrix(data.frame(predict(dmy_train,df_train)))
X_test <- as.matrix(data.frame(predict(dmy_test,df_test)))
y_train = as.integer(as.matrix(df_train$target))
y_test = as.integer(as.matrix(df_test$target))
bst <- xgboost(data = X_train, label=y_train, max_depth = 2, nround = 10, 
               verbose=0,
               objective='binary:logistic',
               eval_metric="logloss")

pred_train <- predict(bst, X_train, type="response")
pred_test <- predict(bst, X_test, type="response")
results["xgb",] <- diagnosis(pred_train, pred_test, df_train$target, df_test$target)
results["xgb",]

importance_matrix <- xgb.importance(model=bst)
xgb.plot.importance(importance_matrix)
# shapley values
xgboost::xgb.ggplot.shap.summary(X_test, model = bst, 
                                 target_class = 1, top_n = 20)  # Summary plot
```

#### CatBoost

We also consider briefly examine the `catboost` library for **gradient boosted decision trees**, given its popularity on machine learning competitions such as Kaggle. The CatBoost library has the advantage of learning a target encoding for categorical variables. From an implementation point of view, this may reduce the preprocessing that may be required.For details on CatBoost, refer to [@prokhorenkova2017catboost]

```{r cb, message=FALSE, warning=FALSE}
X_train = df_train[,c(cat_feats, cont_feats)]
y_train = as.integer(df_train$target)
X_test = df_test[,c(cat_feats, cont_feats)]
y_test = as.integer(df_test$target)

pool <- catboost.load_pool(X_train, y_train, cat_features = cat_feats)
model <- catboost.train(pool, params=list(depth = 8, iterations = 10, 
                                          loss_function='Logloss', verbose=0))

pred_train <- catboost.predict(model, catboost.load_pool(X_train), prediction_type = 'Probability')
pred_test <- catboost.predict(model, catboost.load_pool(X_test), prediction_type = 'Probability')

data_shap_tree <- catboost.get_feature_importance(model, pool = pool,
                                                  type = "ShapValues")
data_shap_tree <- data.frame(data_shap_tree[, -ncol(data_shap_tree)]) 
names(data_shap_tree) = names(df[, c(cat_feats, cont_feats)])


ggplot(stack(data_shap_tree), aes(x = ind, y = values)) +
    geom_point(aes(color = values)) + coord_flip() + ggtitle("Shapely Values by variable")

results["catboost",] <- diagnosis(pred_train, pred_test, df_train$target, df_test$target)
results["catboost",]
```

```{r}
feat_importance <- catboost.get_feature_importance(model, pool)
importances <- data.frame(feat_importance[order(feat_importance, decreasing=FALSE),])
importances$features = rownames(importances)
names(importances) <- c("importance","features")
importances$features <- factor(importances$features, level=importances$features)
ggplot(importances, aes(x=importance, y=features)) + geom_bar(stat="identity")
```

## Results

```{r}
results
results[order(-results$test_auc, -results$test_acc),]
```

## Bibliography


